#position bias21/09/2025
When LLMs Judge: Signals, Biases, and What Real Evaluation Should Look Like
'LLM-as-a-Judge systems show measurable biases and attack vulnerabilities; their agreement with humans is task-dependent. Practical evaluation favors trace-based outcome metrics, component-level tests, careful prompting, and ensembling for constrained tasks.'